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ABSTRACT 



Evidence from a variety of sources suggests that systematic differences can 
be found in the ratings given to student essays as a function not only of the 
student's skills but al^o of aspects of both the student's background and the 
background of the rater. Additionally, the nature of the prompt which provided the 
central theme of the essay might bias the outcome of the ratings of that essay. 
A> study of ratings of fifth and sixth graders who wrote paragraph-long essays in 
response to two topics presented either in written or pictorial form is presented. 
Students were classified as Hlspanic-surnamed or non-Hi span ic-sumamed; two 
teachers, trained as raters using an objectively-based essay scoring scheme, rep- 
resented an Hispanic cultural background and two a non-Hispanic background. 
Results from a blind rating of 100 complete essays show that several of the rating 
subscales were significantly Influenced by an interaction between student ethnicity 
"and rater ethnicity, and several subscales by rater ethnicity alone. Student 
ethnicity alone was not a significant main effect en any subscale. Prompt modality 
ts significant for one subscale, and interacts with rater ethnicity on one other. 
The findings are interpreted as a direct indication of biased assessment. 
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Introduction 

, The evaluation of school children's prose writing poses special problems in ^ 
relation' to bias in'educational appraisal. Many factors have fong been known to 
have major influence on the prose writing performance of minority pupils. The 
literature on the issue of biases which occur in the judgement of students' written 
work is much smaller, and has proved much more contradictory. Are there specific 
aspects of non-native English writing style which undermine the usual procedures 
for judging writing performance? Do raters who match the cultural background of the 
writers whose work they judge arrive at different conclusions from raters who do 
not share the same background? In the present paper, the results jf a research 
study involving both writers and readers from two different cultures are examined 
in.an attempt to partition out the sources of systematic bias in the evaluation of ^ 
writing. 

Sources of Bias: Student Variables > 

An overarching concern in the literature about bias in writing has been the 
isolating of sociocultural factors in students' backgrounds which contribute to 
differences in performance. A half-century ago, Caldwell and Mowry (1933) demonstrated 
that bilingual Hispanic children, due to their use of language compared to their 
monolingual English-speaking counterparts, were at a disadvantage when evaluated 
by the essays they wrote; on.objective examinations the differences were not nearly 
as acute. Parallel findings emerge from the recent large-scale study by White and 
Thomas (.1981), who combined files of data regarding entering students in the 
California Statp University and Colleges system to yield graphic comparisons of 
total scores for 5,246 Whites, 585 Blacks, 449 Mexican-Americans, and 617 Asian- 
Americans on two English placement exams. The first was the CSUS'? own English 
Placement Test; the. second was the Test of Standard Written English from the College 
Entrance Examination Board. Although no statistical analyses were presented, 
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profiles of the four distributions suggest that a dialect interference or second 

language interference hurt the overall performance of the three minority samples 

on both testi. Lay (1978) has shown that na-tive-speaking Chinese students are 

at a disadvantage in writing English prose because of the wide differences in 

structure and phonology of English and Chinese. Rizzo and Villafane (1978) have 

shown that similar explanation applies to native Spanish-speaking students. 

Many investigators of language have shown that structural aspects of both 

oral and written language are significant in determining how children process the 

world around them. Moreover, many of the rules which govern functions of sending 

and receiving meaning using oral language are significantly different from those 

for written expression (Olson, 1977). For the non-native speaker of English 

the task of writing in English poses a particular problem because 

...the surface structure of writing is an inadequate representation 
of* both the sound structure o % f the target language and its meaning. 
Learning the underlying structure of the target language is as much 
of a bootstrap operation as the initial process of learning a mother 
"tongue (Smith, 1975, p. 359). 

One practical outcome'of such a structural viewpoint is that students who fail 
to acquire skills in the underlying structure of English might do passably well 
with spoken English but probably will have. great difficulty with writing. Another, 
factor not to be dismissed lightly is the attitudinal or psychological readiness 
of the student to orient positively to the task of acquiring skills in a' new 
language (Cervantes, 1975; Lambert, Gardner, Bank, & Tuns tall , 1963). Without 
the necessary motivation and appropriate learning context, students may be unable 
to let their knowledge of both the mother tongue and the new language interact to 
their advantage. 

Sources of Bias: Evaluation Variables 

Beyond the issues of students' involvement in languages lies an important 
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realm of educational and psychometric considerations having to do with the quantity 

and quality of appraisal. The nature of the task, how.it is interpreted by both 

the student and the teacher, and the tools with which the students' writing is 

judged and by whom are all issues of import. In each of these lies the possibility 

of systematically different patterns of response for students from culturally or 

linguistically different groups. Each, then, may introduce its own bias into the 

evaluation of writing. The purpose of the writing task usually given to students 

1 

in the classroom is to construct an essay following a particular prompt . The 
teacher seeks a sufficient amount of this writing to rate the quality of the 
student's work. Exactly what elements are most important in that assessment of 
writing is often dependent upon the persons creating the scoring system. Freedman 
(1979) attempted to specify "definable parts" of student compositions which in- 
fluenced teacher judgments. She concluded that content, organization, and 

e 

language mechanics were the most important factors, in that order. The effect 
of "weak" content was so powerful that it overshadowed teacher judgment in every 
other category. The interaction of content quality judgments with the quality of 
the writing prompt is one point where bias in assessment is possible. ) 

The use of incompletely explicated scoring. criteria introduces another 
potential for bias in writing studies. In Rhodes-Hoover and Politzer's (1974) 
•study of teachers' attitudes toward Black rhetoric, teachers downgraded composi- 
tions in the category of "language mechanics" because students failed to use 



hhe prompt itself may contrib „e to systematic bias Some students may not know 
what the prompt represents be^ .se they do not completely understand* the vocabulary 
of the prompt in written form, or do not recognize the pictorial content the palm- 
tree vs everqreen problem). Differences of an extreme nature are found 1n rec- 
ognition Tf ?hrlS-d1mensional objects 1n photographs or drawings between cM *jn 
of developed and underdeveloped countries. Subtler problems of prompt recogni ability 
abound: one British picture recognition test for the primary grades depicts 
electrical items common in England but totally unknown in America. 
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■ "Werstandard" EngMsh. For example, if .a student wrote, "I got there" as 
ppposed to, "I reached my destination," the passage was considered too colloquial. 
' Teachers not only gave their own interpretation of "usage" and "colloquial" but 
also imposed an undocumentable degree of severity in their judgment that may or may 
not have been intended by the scale. 

In a study comparing the syntactical characteristics of Mexican and Anglo- 
American prose, Rodrigues (1978) asked educators whether they could defect "slight" 
or "noticeable" differences. More Anglo-American educators found "jjoticeable" 
differences than did Mexican-American raters/ Bikson (1977) conducted a study of 
differences in working lexicons of 72 lower grade and 72- upper grade White, Chicano 
and Black. elementary school students. Results showed that ethnically diverse 
speakers made different kinds of lexical choices, particularly in the early grades. 
The differences between Anglo lexicon and either the Black or Chicano lexicon were 
greater than the differences between the two minority lexicons. The study found 
varying degrees of overlap between minority and Anglo word choice. The minority , 
students used a wider range of vocabulary than the Anglo group, but this "broader" 
working vocabulary is not often valued by persons evaluating the speech of these 
students. 

Differences jn classification of lexical terms between different linguistic 
groups may have consequences for the selection of scoring criteria to evaluate the 
writing of these groups. If we take concept classification tasks to be analogous 
to organization tasks in the writing process, then the different strategies used 
to associate words may reflect different preferred methods of°essay organization. 
If the scoring criteria implicitly prefer one type of content organization 
strategy, such preference could result 1n bias against those students who adopt 
alternative strategies. Two studies in particular seem to suggest that words are 
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sorted by different ethnic groups into categories according to different class- 
ification- strategies. Rissel (1978) studied the vocabulary-semantic relationship 
for monolingual English speakers, monolingual Spanish speakers and Spanish/ 
English bilinguals living in New York and Puerto Rico to determine the classification 
strategies of these groups. The study found that not only did the classification 
strategies vary by linguistic group but that there appeared to be a relationship 
between amouwfe^of language dominance" and classification strategy. Spanish dominant 
bilinguals employed comparative criteria, whereas the more "balanced" bilinguals 
used comparative classifi catioii for Spanish words anu inclusive classification 
for English. Stahl (19)7) conducted a study comparing the "methods for arrange- 
ment" of content used by Israeli students of European or Arabic extraction. He 
found that those of European background tended to arrange the content in a 
hierarchial or inclusive manner, whereas those of Arabic background tended to use 
more associative or comparative techniques. An interesting aspect of his method 
was that he gave higher points for hierarchial classification than for the use of 
comparative methods. In the assessment of writing this would appear to be de- ^ 
liberate introduction of biased criteria into the scoring process. Contrary results 
have been reported. In a study of syntactic patterns of lower and middle class 
Chicanos, Garcia (1975/76) concluded that the Chicanos used the same basic patterns 
found in American English, a conclusion also tendered by Rodrigues (1978). At 
the same time, however, Garcia cited research demonstrating differences in the 
morphological and phonological systems used by Chicanos and Anglos. 

Recent informal evidence demonstrates the potency of systematic differences 
among raters of writing. Hartwell (1981) found that older, more experienced 
writers selected very different passages as exemplary of "professional writing" 
than did* college freshmen. The differences appear to be consistent along a number 
of dimensions, including conten., coherence, degree of complexity, and development. 
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Differences in rating ^f a written essay may also be related to the rater's own 
level of cognitive complexity and integration (Sternglass, 19811. Rater back " 
ground has been found to influence how scoring criteria are interpreted and applied. 
Follman and Anderson (1967) concluded that when -raters -shared similar backgrounds 
with regard to education and opfnions about what constitutes good" writing, they 
teBded to agree oa the ratings of essays «ore than raters who differed along these . 
dimensions. * 

Whether writing is assessed through normative-holistic means or through 
differentiated judgments on dimensions of rhetorical quality, the scoring "in- 
strument" will always be a human judge. Consequently, no question about fairness, 
validity or accuracy in writing assessment cjin be fully addressed without refer- 
ence to possible errors in judgment.. The intention of writing assessment is tu 
generate information useful for diagnosis and/or remediation. When diagnostic _ . 
utility is of interest several other issues are pertinent. Diagnosis implies 
performance profiles which* in turn require a multidimensional view of the writing 
skill domain. Questions about skill jjrofi les are connected intimately to rater 
behavior in assigning ratings. Scoring criteria are filtered through the expec- 
tanciesof raters, and the halo effect inflates inter-subscale correlations (Jaeger 
& Freijo, 1975), The use of more and longer writing tasks only exacerbates this 
phenomenon. 

Rating scales may interact. It is common for writing score profiles to 
include some attention to essay "mechanics"; variations along this dimension may 
influence ratings otif other dimensions. Ratings assigned to a writing sample on 
such dimensions as "organization" or "use of supporting detail" may be assigned' 
differentially depending on the quality of mechanics within the essay. For 
mechanically-substandard work, this process might bring the assessment of other 



dimensions 'oV writing quality into. line with the rater's impression of mechanics, 
while if level of mechanics is not so low as to call attention to itself, there may 
' be minimal confounding. However, across a given set of papers the net effect would 

* " 

be correlated true and error components and concomitant inflation of inter-subscale 
correlations; In a mul'titrait-multiniethOd- factor analytic formulation the expecta- 
tion in general would -be for negative correlations between mechanics "trait" factors 
• and ratings "method" factors. QueVlmalz and Capell (1979) used multitrait-multi- 
method confirmatory factor analyses to examine discriminant validity of subscales 
generated by analytic scaring rubrics and the comparative information yield of 
"alternative response modes for writing assessment (i.e., essay, paragraph ai?d 
selected~respcr.$e). Their results indicated relatively high interconnections among 
subscal eContent factors, -as well as a general tendency for the .shorter assessment * 
modes ,tb generate less pure indicat6rs Of the subscale factors. 

'* If! non-native Ergtish speakers' English writing is easily distinguished from • 
that" of native speakers' orf the dimension ot mechanics, and if such group differences 
' contaminate other. ratings assigned to non-native speakers, a straightforward form 

of bias may be present. Ratings- on other dimensions will be systematically de- 

- i " / 

pressed, and the 'diagnostic utility of the writing appraisal' undermined. The 

% * * 

present study was conducted, to evaluate such bias in the context of variations of 
ethnici'ty^of both student? and raters, and of prompts. - Additionally, the nature 
of the task presented to^ students in order' to^'get them to write an essay, was 
varied systematically. * v 

-A. • / 

• METHOD . 

Subjects , . ' t 

One hundred and thirty students from fifth! and sixth-grade monolingual English 
'classrooms in a moderately sized California school district were involved'in this 
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study as a normal part of their classroom activities. These students were not 
members of bilingual programs although some were involved in remedial "pull-out\ 
instruction. Of the 116 students who provided complete essays, half were Hispanic- 
sur named. Raters were four teachers. hi red during school vacation, of whom two 
were Hispanic and two non-Hispanic. These raters were from different school 
districts and had no other contact of any kind with the students in this sample. 

Instruments 

The study used a standardized writing task with two topics, and a modified 
scoring rubric, which will be explained below, which has been shown tc lave 
acceptable validity and reliability (Quellmalz & Capell, 1979), The packet con- 
taining the essay writing task consisted of a face sheet for student's name and 
date, followed by two prompts and two lined response pages, totalling five pieces 
of paper per handout. The prompts .involved two topics, one a main street of a town 
and the other a robot. Order of presentation of the prompts, and whether the 
prompt was written or pictorial, was controlled for every participant. Written 
prompts involved five lines" of typewritten text, while picture prompts involved a 
lead sentence and a full-page line drawing of the children's topics by a graduate 
student artist. In both situations, the text concluded with the request that the 
student write a paragraph about the topic presented. No other information was made 
available to the student. 

.The nters reviewed these essays using the Center for the Study of Evaluation's 
Factual Narrative scoring rubric, consisting of four primary subscales — General 
Impression, Focus and Organization, Support, id Grammar and Mechanics. Each of 
these was evaluated on a six-point scale, ranging from clear mastery of the 
assignment to clear failure. For each of the six values on each of the four scales, 
extensive guidelines for scoring were provided. General Impression rating of the 
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essay is formed by considering all aspects of the effectiveness of composition, 
including the remaining three rating criteria. The Focus and Organization sub- 
scale handles such issues as logical progression, transitions, and topic develop- 
ment. The Support subscale rates the use of specific supporting statements and 
details. The Grammar and Mechanics subscale is used to evaluate the essay's 
sentence construction, word usage, spelling and punctuation. In addition to an 
overall rating from this last subscale, the extent of errors of each of the four 
areas of Mechanics noted above is rated separately. The instructions of the CSE 
scoring rubric make explicit that raters using factual scoring will likely find 
that some qualities of an essay cannot be considered separate from others, but it 
is also quite direct in indicating how any particular rating is to correspond to 
the annotation supplied in the guidelines. 

P rocedure 

Each child received one essay packet containing two essay prompts - one 
pictorial and the other written - and ruled pages for the child's essays. The 
package of essay prompts was administered in a single half-hour sitting by the 
children's classroom teach#rs, and essays were collected and sent directly for 
rating without further intervention in the classroom. 

Each of the raters was given every essay packet in random order, but without 
the face sheet and thus without Identification of the name o, ethnic background 
of the student writers. Following five days of training and pilot testing on use 
of the CSE rating scales, the four raters completed scoring of the 116 essay 
packages which were complete and legible over a seven day period. The resulting 
32 ratings for each essay (four raters x eight subscales) were then analyzed by a 
three factor analysis of variance (student ethnicity x rater ethnicity x prompt 
modality) with repeats on the second two factors (Winer, 1962) separately for each 
subscale. Also collected from school district records were subtest totals on the 

mc 4 14 • - 
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Comprehensive Tests of Basic Skills (CTBS), administered as part of the regular 
testing program by the school district, for all students involved in the study. 
These scores allowed the investigation of possible relationships between the 
measures of writing capability and fou, aspects of students' intellectual ca- 
pacity-'- vocabulary, passage comprehension, language mechanics and expression. 

RESULTS AND DISCUSSION 

Only essays with complete ratings were considered in the analysis; complete 
data were available for the four primary subscales for 100 essays, and for the four 
detail subscales for 74 essays. Average rater agreement across all subscales was 
high for the two Hispanic raters (92.15%) and moderately good for the non-Hispanic 
raters (65,46%). When all four raters were compared, average agreement on the 
subscales was good (81.15%). These values were considered as acceptable evidence 
that the training of the raters had been satisfactory. To minimize potential 
confounding from differences between the two topics, all scores were then star.- 
_ * iiardizecLwithin topic Jjefore further analysis. 

On the General Impression subscale, the interaction between student ethnicity 
(Hispanic or ncjp- Hispanic) and rater ethnicity (Hispanic or non-Hispanic) was 
significant (F^ 98 =6.51, MSerror = 13.37, p<.01). While the non-Hispanic student 
essays received about the same General Impression scores from Hispanic raters as 
,the Hispanic student essays, the non-Hispanic raters-significantly favored the 
non-Hispanic student essays. No other main effect or interaction was significant 
for this subscale. The interaction between student ethnicity and rater ethnicity 
was also found on the Support subscale (Fj ^=4. 02, 'MSerror = 31.48, p<.05'), and 
on the Mechanics subscale (F^s* 7 - 18 . MSerror = 36.42, p<.01). On the Support 
subscale, the non-Hispanic student essays, were again significantly favored by the * 
non-H1spanic ratens. However, on the Mechanics sxjbstale, the non-Hispanic raters 
O judged both student groups alike while the Hispanic raters gave the essays of the 
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non-H1spanic students significantly lower scores. 

For the Focus subscale, a main effect of rater ethnicity (F ^=11.82, MSerrer 
16.62, p<.001) and an interaction between rater ethnicity and prompt mode (picture 
prompt or written prompt) (F 1>g8 = 5.41, MSerror = 19.01, p<.01) were found. In 
addition to the rater ethnicity by student ethnicity interactions, the Support 
subscale yielded only a main effect of prompt modality (F 1)98 = 10.43, MSerror » 
68.17, p<001), and the Mechanics subscale yielded only a main effect of rater 
ethnicity (F g8 = 13.45, MSerror = 36.42, p<.001). On the detail subscales of 
Mechanics, only one effect emerged as significant: rater ethnicity as a factor 
in Usage ratings (Fj 73 = 41.01, MSerror = 47.01, p<.001). No other detail sub- 
scale showed any significant main effect or interaction. Table 1 summarizes the 
findings across the fou: primary and the usage detail subscales by main effect and 
interactions, and the results of post-hoc analyses. 

When performance scores on the CTBS were compared, neither the Hispanic nor 
non-hispanic students emerged as significantly more capable on any subscale than 
the others. The results of the correlational study between student essay ratings 
and the four selected scale scores from the CTBS can be summarized rapidly. Not 
a single significant correlation appeared between any rating subscale and any 
CTBS scale for this sample. Thus there appears to be no intrinsically overlapping 
Information between writing performance as judged on CSE's Factual Narrative rubric 
and a sample of academic performance as judged on a multiple-choice examination. 

The most important finding, repeated across three of the subscales, is that 
the student ethnicity and rater ethnicity factors interact frequently and substan- 
tively 1n the appraisal of students ■ written essays.- Additionally, rater ethnicity 
alone 1s also a significant factor in the ratings. These results point to three 
conclusions. First, the evaluation of prose writing ^eems to be systematically 



Table 1 

Summary of Statistically Significant (p<.05) Effects 



Subscale: 



Genera^ 
Impression 
100 



Main Effects 

Student 
Ethnicity 

Rater 
Ethnicity 

« 

_ Prompt 

Interactions 

Student x Rater 
Student x Prompt 



Rater x Prompt 

Student x Rater 
x Prompt 



*4 



Focus and Support Mechanics Usage 

Detail 1 
100 100 74^ 



Organizati on 
100 



*2 



*5 



r 



Remaining detail subscales show no significant effects. 
2 Hispanic raters elevated "relative to non-Hispanic raters. 
3 P1cture prompt elevated relative to written prompt. 

4 Non-Hispanic raters + non-Hispanic student essays elevated relative to other 
combinations. * / 

5 Hispanic raters + non^hispanic student essays depressed relative to other 
combinations . 

6 Non-Hispanic raters + Hispanic .student essays, elevated relative to other 
combinations. \ 
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affected by factors which reflect different cultural backgrounds. It is important 
to note that this effect does not emerge when essays are grouped solely by student 
ethnicity; rather, the students of one or the other backgrounds were often judged 
differently by raters who share that background than by raters who do not. Second, 
these factors include (but are not limited to) a match or mismatch between raters' 
and writers' preferred language styles, and to some extent the nature of the stimulus 
used to initiate the writing sample. Note, however, that the three factor inter- 
action between student ethnicity, rater ethnicity and type of prompt was not 
observed for any of the subscales used. Third, the phenomenon of systematic 
matching ,or mismatching of preferences and styles occurs despite the fact that 
the evaluative scheme used is one with a high degree of objectivity, which would 
be expected to minimize such matching relative to more subjective rating scheme. 

the nature of the judg ment task is re ference 1 I point-for-point by the CSE scoring 

rubric and thus no scale-free or endpoint-only continuum judgments were involved. 
Additionally, because raters were blind not only to the names and ethnicities of 
the essay writers. i>ut to_the^s tudy_!5_ hyjmlbeiLes^anl the jiroporti onal representa - — 
tion of ethnicities within the sample, whatever matching occured most likely stems 
from recognition of and preference for certain subtle aspects of writing styles. 

Some limitations of the present study deserve attention. There are many 
possible secondary analyses of writing style, process and content which have not 
been pursued here. No information about essay complexity or other linguistic 
patterns is available from the present analysis. How creative, stereotyped, or 
bizarre the particular essay is goes unremarked in the CSE scoring system. The 
isolation of exact details within essay content or specific preferences of 
individual raters was fiot within the purview of this investigation. Moreover, 
there is a small possibility that systematic differences in handwriting mastery 
contributed to the recognizability of student ethnicity and thus to the ratings 
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given, but this was not examined directly. None of these considerations is seen 
* Us critical to the interpretation of the results presented above, in particular 
because the expected outcome of the analyses of variance in such instance would 
necessarily be a main effect due to student ethnicity alone or a three-way inter- 
action between student ethnicity, rater ethnicity and prompt modality. None of 
these effects emerged in the present study, "but rather a pattern of findings 
which strongly suggests that some complex form of bias is at work. 

Bias in judgment is a phenomenon which obtains under a variety of circumstances, 
some of which are intrinsic in the testing and evaluation process. The present 
findings indicate that extrinsic factors must also be considered. In the case of 
judgment of essays, where essay content has virtually limitless possibilities and 
appraisal is of necessity at least partially subjective, the opportunity for 
- -unintention«T~Mas sers more mely. For the teacher or essay test administrator 
seeking to limit bias to the .absolute minimum, the mandate is: those who are to 
perform the rating of the essays must be matched for appropriate backgrounds of tne_ 
studentTwho" write~tfie essays and are "jud jed" . 
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